Library Imports
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.sql import functions as F
from datetime import datetime
from decimal import Decimal
Template
spark = (
    SparkSession.builder
    .master("local")
    .appName("Section 2.7 - Equality Statements in Spark and Comparison with Nulls")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()
)
sc = spark.sparkContext
import os
data_path = "/data/pets.csv"
base_path = os.path.dirname(os.getcwd())
path = base_path + data_path
pets = spark.read.csv(path, header=True)
pets.toPandas()
| id | breed_id | nickname | birthday | age | color | |
|---|---|---|---|---|---|---|
| 0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown | 
| 1 | 2 | 3 | Argus | 2016-11-22 10:05:10 | 10 | None | 
| 2 | 3 | 1 | Chewie | 2016-11-22 10:05:10 | 15 | None | 
| 3 | 3 | 2 | Maple | 2018-11-22 10:05:10 | 17 | white | 
Filtering Data
When you want ot filter data with more than just one expression, there are a couple of gotchas that you will need to be careful of.
Case 1: Multiple Conditions
(
    pets
    .where(
        (F.col('breed_id') == 1) &
        (F.col('color') == 'brown') &
        F.col('color').isin('brown')
    )
    .toPandas()
)
| id | breed_id | nickname | birthday | age | color | |
|---|---|---|---|---|---|---|
| 0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown | 
What Happened?
When there exists more than 1 condition you will to wrap each condition in () brackets and as well provide bitwise operations instead of logical operations in Python.
Why?
This is because in the spark internals they had to overwrite the logical operations and was only left with the bitwise operations. This is to my best knowledge, I could be wrong.
Case 2: Nested Conditions
(
    pets
    .where(
        (
            F.col('breed_id').isin([1, 2]) &
            F.col('breed_id').isNotNull()
        ) |
        (F.col('color') == 'white')
    )
    .toPandas()
)
| id | breed_id | nickname | birthday | age | color | |
|---|---|---|---|---|---|---|
| 0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown | 
| 1 | 3 | 1 | Chewie | 2016-11-22 10:05:10 | 15 | None | 
| 2 | 3 | 2 | Maple | 2018-11-22 10:05:10 | 17 | white | 
What Happened?
Similar to before, nested conditions will need to be wrapped with () as well.
Case 3: Equality Statements with Null Values, (use isNotNull() and isNull())
(
    pets
    .withColumn('result', F.col('color') != 'white')
    .withColumn(
        'result_2', 
        (F.col('color') != 'white') &
        (F.col('color').isNotNull())
    )
    .toPandas()
)
| id | breed_id | nickname | birthday | age | color | result | result_2 | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown | True | True | 
| 1 | 2 | 3 | Argus | 2016-11-22 10:05:10 | 10 | None | None | False | 
| 2 | 3 | 1 | Chewie | 2016-11-22 10:05:10 | 15 | None | None | False | 
| 3 | 3 | 2 | Maple | 2018-11-22 10:05:10 | 17 | white | False | False | 
What Happened?
If you do not come from a sql background any comparison with Null will also be Null, unless you specifically use the Null comparisons.
The 2 Null comparisons are isNotNull() and isNull().
Summary
- In spark when using a more involved conditional expression, you will need to wrap each condition with - ()brackets and use the bitwise operations in Python.
- Be explicit with you're performing conditional transformations on columns that can be - Null.